feat(runtime): add standalone edge runtime for Pi/Jetson deployment#799
feat(runtime): add standalone edge runtime for Pi/Jetson deployment#799
Conversation
Break the edge runtime's dependency on runtimes/universal/ by copying and trimming the needed files into runtimes/edge/. The edge runtime is now fully self-contained with its own pyproject.toml, Dockerfile, and zero imports from universal. Key changes from universal: - models/__init__.py exports only 4 model types (was 12) - vision router includes only detection/classification/streaming (no training, evaluation, tracking, OCR, or document extraction) - chat_completions/service.py makes heavy utils optional (context summarizer, history compressor, tool calling, thinking) - file_handler.py rewritten without PyMuPDF (no PDF processing) - context_calculator.py makes torch import lazy for GGUF-only deploys
✅ All E2E Tests Passed!Test Results by Platform
Summary
This comment was automatically generated by the E2E Tests workflow. |
Move 4 identical utility modules from both runtimes into llamafarm_common so bugs fixed in one place apply everywhere: - utils/safe_home.py → llamafarm_common/safe_home.py - utils/device.py → llamafarm_common/device.py - utils/model_cache.py → llamafarm_common/model_cache.py - utils/model_format.py → llamafarm_common/model_format.py Both runtimes now have thin re-export shims that import from llamafarm_common, so all internal `from utils.X import Y` statements continue to work unchanged. Also: - Add cachetools dep to llamafarm_common (needed by model_cache) - Consolidate pidfile.py to use safe_home instead of duplicating home directory resolution logic - Fix model_format.py internal import to use relative import within the common package Removes ~1,100 lines of duplicated code across the two runtimes. Candidates identified but not moved (would add heavy deps to common): - core/logging.py (needs structlog) - services/error_handler.py (needs fastapi) - models/base.py, vision_base.py (architectural scope change) - All router files (deeply coupled to FastAPI app structure)
Add HailoYOLOModel that runs YOLO inference on the Hailo-10H AI accelerator using pre-compiled .hef models from the Hailo Model Zoo. The server auto-detects Hailo hardware at startup: - Checks for hailo_platform package - Checks for PCI device ID 1e60 (Hailo-10H) via lspci - Falls back to /dev/hailo0 device node - Falls back to CPU/ultralytics if Hailo not available New file: models/hailo_model.py - HailoYOLOModel with same interface as YOLOModel - Letterbox preprocessing for aspect-ratio-preserving resize - NMS output parsing (Hailo .hef models include built-in NMS) - COCO 80-class label mapping - Configurable .hef directory via HAILO_HEF_DIR env var Server changes: - load_detection_model() selects backend based on hardware detection - FORCE_CPU_VISION=1 env var to skip Hailo and force CPU - hailo_platform import is fully optional (try/except)
Build from repo root with -f flag so COPY can reach common/ and packages/. Use --no-sources to skip [tool.uv.sources] relative paths that don't apply inside the container. Usage: docker build -t edge-runtime -f runtimes/edge/Dockerfile .
Two bugs prevented llama.cpp from loading on ARM64 Linux (Pi/Jetson): 1. Version mismatch: _get_llamafarm_release_version() read the llamafarm-llama package version (0.1.0) but the ARM64 binary is published under the main monorepo release tag (v0.0.28). These versions are decoupled. Now queries GitHub API for the latest release, with LLAMAFARM_RELEASE_VERSION env var override and v0.0.28 hardcoded fallback. 2. Extension mismatch: manifest template used .tar.gz but the actual published asset is .zip. Fixed to match.
The pre-built llama.cpp ARM64 binary requires GLIBC 2.38+ but python:3.12-slim-bookworm only has GLIBC 2.36. Switch to ubuntu:24.04 (GLIBC 2.39) and install Python via apt. Also add --break-system-packages to uv pip install since Ubuntu 24.04 marks system Python as externally managed (PEP 668). This is safe inside a container.
Address CodeQL review comments: - Remove `from module import *` from all 8 re-export shims (edge + universal). The explicit imports already cover everything needed. - Remove unused `get_file_images` import from edge server.py
Dockerfile: - Install vision extra (ultralytics, transformers) and pi-heif - Add system libs for OpenCV (libgl1, libglib2.0-0, libxcb1) - Set YOLO_AUTOINSTALL=false to prevent runtime pip installs Vision hardening: - Strip whitespace from base64 input before decoding (fixes newlines from curl piping with jq/base64 tools) - Wrap PIL Image.open() with proper error handling in vision_base, detect_classify — returns clear error instead of raw traceback - Pre-register HEIF plugin in yolo_model.py to prevent import errors on some ultralytics builds
…edge - Re-export HfApi and _check_local_cache_for_model from the universal model_format shim so tests that mock utils.model_format.HfApi continue to work - Add project.json for edge runtime so Nx can process the project graph without failing on the unnamed directory
…/rag The model_format tests mocked utils.model_format.HfApi, but after the refactor to a re-export shim, detect_model_format lives in llamafarm_common.model_format and uses its own HfApi reference. Fix mock targets to patch at the source module. The E2E source tests failed because UV_EXTRA_INDEX_URL and UV_INDEX_STRATEGY leaked from the CI environment into server/rag processes via os.Environ(). The PyTorch CPU index only has cp314 wheels for markupsafe, causing install failures on Python 3.12. Strip these vars from the base process environment so only services that explicitly declare them (universal-runtime) receive them.
Review Summary by QodoAdd standalone edge runtime for Pi/Jetson deployment with GGUF inference and KV cache management
WalkthroughsDescription• Introduces a fully self-contained edge runtime for Pi/Jetson deployment with zero dependencies on runtimes/universal/ • Implements GGUF language model inference via llama-cpp with memory-optimized quantized model support for edge devices • Adds multi-tier KV cache management (VRAM → RAM → disk) with segment-level validation for efficient prefix caching • Provides OpenAI-compatible chat completions service with optional heavy utilities (context summarizer, history compressor, tool calling) • Implements vision routers for detection, classification, and streaming with Hailo-10H accelerator support • Adds comprehensive context management with multiple truncation strategies (sliding_window, keep_system, middle_out, summarize) • Introduces thinking/reasoning model support with budget allocation and chain-of-thought utilities • Consolidates device detection and model format utilities into common llamafarm_common package for code reuse • Includes GPU allocation with VRAM estimation and SSRF-safe remote cascade support • Provides GGML logging integration and metadata caching to optimize performance on constrained hardware Diagramflowchart LR
A["Edge Runtime<br/>FastAPI Server"] --> B["GGUF Language<br/>Model"]
A --> C["Vision Models<br/>Detection/Classification"]
A --> D["KV Cache<br/>Manager"]
B --> E["llama-cpp<br/>Inference"]
B --> F["Context<br/>Calculator"]
D --> G["Multi-tier Cache<br/>VRAM/RAM/Disk"]
A --> H["Chat Completions<br/>Service"]
H --> I["Optional Utils<br/>Summarizer/Compressor"]
H --> J["Tool Calling<br/>& Thinking"]
C --> K["Hailo-10H<br/>Accelerator"]
L["Common Package"] -.->|Device Detection| A
L -.->|Model Format| A
File Changes1. runtimes/edge/models/gguf_language_model.py
|
Code Review by Qodo
1. Undocumented UV_INDEX_STRATEGY env var
|
lspci -d 1e60: already filters by vendor ID, but the old check looked for "1e60" in the output text. lspci resolves vendor IDs to names, so the output contains "Hailo Technologies Ltd." instead of "1e60", causing detection to always fail. Check for any non-empty output instead.
ConfiguredInferModel doesn't expose .output() in HailoRT 5.2.0. Use self._infer_model.output().shape instead of self._configured.output().shape to get the output buffer dimensions.
ConfiguredInferModel.run() does not accept timeout_ms in HailoRT 5.2.0.
HailoRT 5.2.0 accepts timeout as a positional arg, not keyword.
… bots - Log warning on failed CPU offload instead of silent pass (base.py) - Log debug on unified memory detection failure instead of silent pass (gguf_language_model.py) - Remove unused _timing_start variable (gguf_language_model.py) - Add path traversal validation in load_language() (server.py) - Add path traversal validation in _read_gguf_metadata() (gguf_metadata_cache.py)
pydantic/ollama-action@v3 downloads a .tgz archive but Ollama v0.18.3 switched to .tar.zst, causing a 404. Use the official install script which handles format changes internally.
Download the install script from a pinned release tag and verify its SHA-256 before executing, addressing supply-chain risk from the unpinned curl|sh approach.
Replace bare pass with debug-level logging for BOS/EOS token decode failures. Behavior is unchanged but intent is documented and failures are observable with debug logging enabled.
Ollama v0.19.0 ships .tar.zst archives. The install script falls back to .tgz when zstd is missing, but .tgz no longer exists (404).
Replace bare pass with debug log when apply_chat_template fails, improving observability while preserving the fallback behavior.
Replace bare pass with debug log in the per-token think-tag detection hook. Keeps suppression to avoid breaking generation but makes failures observable.
Replace empty except with debug log when parsing n_ctx_train from GGUF metadata fields fails, consistent with other debug logging in this module.
Replace empty except with debug log when /proc/meminfo is unavailable, documenting the fallback to psutil.
The path traversal guard rejected all absolute paths, even when they resolved within ~/.llamafarm or cwd. Keep the .. rejection but let absolute paths through to the containment check.
The done-callback raises ValueError if stop() already cleared _pending_futures before the future completes.
Block Windows drive-style paths like C:/... that bypass the existing validation by containing a forward slash.
The log_file argument was mistakenly removed. setup_logging does accept it, so LOG_FILE env var was silently ignored.
The concurrency refactor stopped updating self.class_names, causing get_model_info to report num_classes as 0.
Use functools.lru_cache to memoize _detect_hailo() instead of a module-level global variable, removing the unused global warning.
The install script appends ?version=v0.19.0 to download URLs when OLLAMA_VERSION is set, which causes the .tar.zst HEAD check to fail and falls back to .tgz (404). The pinned install script from the release provides sufficient version control.
Replace empty except with debug log for optional HEIF support.
Break the edge runtime's dependency on runtimes/universal/ by copying and trimming the needed files into runtimes/edge/. The edge runtime is now fully self-contained with its own pyproject.toml, Dockerfile, and zero imports from universal.
Key changes from universal: